05, July 2019
In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns.
Create a Dataframe with some duplicate rows.
import pandas as pd
students = [('Vishal', 34, 'Sydney'),
('Gaurav', 30, 'Delhi'),
('Madhura', 16, 'New York'),
('Vishakha', 30, 'Delhi'),
('Vishakha', 30, 'Delhi'),
('Vishakha', 30, 'Mumbai'),
('Madhura', 40, 'London'),
('Manavika', 30, 'Delhi')
]
# Create a DataFrame object
students_df = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
students_df
Name | Age | City | |
---|---|---|---|
0 | Vishal | 34 | Sydney |
1 | Gaurav | 30 | Delhi |
2 | Madhura | 16 | New York |
3 | Vishakha | 30 | Delhi |
4 | Vishakha | 30 | Delhi |
5 | Vishakha | 30 | Mumbai |
6 | Madhura | 40 | London |
7 | Manavika | 30 | Delhi |
To find & select the duplicates in all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence. Now pass this Boolean Series to List[] operator of Dataframe to select the rows which are duplicate i.e.
# Select duplicate rows except first occurrence based on all columns
students_df.duplicated()
duplicateRowsDF = students_df[students_df.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)
Duplicate Rows except first occurrence based on all columns are : Name Age City 4 Vishakha 30 Delhi
Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.
If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.
# Select duplicate rows except last occurrence based on all columns
duplicateRowsDF = students_df[students_df.duplicated(keep='last')]
print("Duplicate Rows except last occurrence based on all columns are :")
print(duplicateRowsDF)
Duplicate Rows except last occurrence based on all columns are : Name Age City 3 Vishakha 30 Delhi
Compare rows & find duplicates based on the selected columns. For this, pass the list of column names in the subset argument: Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.
Find & select rows based on a single column:
duplicateRowsbyCol = students_df[students_df.duplicated(['Name'])]
print("Duplicate Rows based on a single column are:", duplicateRowsbyCol, sep='\n')
Duplicate Rows based on a single column are: Name Age City 4 Vishakha 30 Delhi 5 Vishakha 30 Mumbai 6 Madhura 40 London